Authors:
Sam Abbott, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK
Hannah Christensen, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK
Ellen Brooks-Pollock, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol, UK
Correspondence to: Sam Abbott, Bristol Medical School: Population Health Sciences, University of Bristol, Bristol BS8 2BN, UK; sam.abbott@bristol.ac.uk; 01173310185
Words: Title: 12 Abstract: 202 Paper: 3196
The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses. This study explores the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.
Background
The Enhanced Tuberculosis Surveillance (ETS) system is a routine surveillance system - with a similar structure to other such systems - that collects data on all notified tuberculosis (TB) cases in England. It is routinely used to study the epidemiology of TB. Routine data often has a large amount of missing data which may not be fully accounted for when used in analyses.
Detail
Describe the ETS and use cases
Missing data can take several forms, data that are missing completely at random (MCAR), data that are missing at random (MAR) and data that are missing not at random (MNAR).[1] Data that are MAR are missing with a mechanism that is conditional on observed variables, whilst MNAR are missing with a mechanism that is conditional on variables that are not observed. Data that is MAR, and MNAR may lead to biases when analysing the data, however it is not possible to deduce from the observed data what the mechanism driving missing data is. Therefore, it is necessary to account for these potential biases during the analysis stage. This is possible using a variety of methods such as scenario analysis accounting for the ‘best’ and ‘worst’ case scenarios, and multiple imputation of missing data using additional variables in the dataset to inform the imputation model.[1] Common practise is to include all variables included in the analyses in the imputation model, these variables may or may not be those at most risk of introducing bias due to an MAR mechanism.
Aim
This study aims to explore the evidence for associations between missingness in several key outcomes and demographic variables. Any such associations may introduce bias if not accounted for.
The ETS is a database that collects demographic, clinical, and microbiological data on all notified TB cases in England and is maintained by Public Health England (PHE). Notification is required by law, with health service providers having to inform PHE of all confirmed TB cases.[2] Data collection began in 2000 and was expanded, with additional variables, with the launch of a web based system in 2008.[3] It is updated annually with de-notifications, late notifications and other updates. A descriptive analysis of TB epidemiology in England is published each year, which reports on data collection, cleaning, and trends in TB incidence at both a national, and sub-national level.[2] Data on all notifications (114,820 notifications) from the ETS system from 2000 to 2015 were obtained from PHE via an application to the TB monitoring team. The code used for data cleaning is available as an R package (https://zenodo.org/badge/latestdoi/93072437).
As the ETS is aggregated across England, from a variety of sources, missing data are inevitable. This takes two forms: under-reporting of notified cases, of which there is some evidence in the literature,[4] and data missing for a notified case. The former is particularly problematic as apart from using comparative studies the characteristics of those that are not notified is unknown. For variables that are missing data within the dataset the proportion of missing data can be calculated but care must be taken to account for nested variables (such as cause of death being dependent on date of death). To account for this when estimating the proportion of missing data we have assumed that nested variables take the value of their parent variable when missing. This approach may be biased for rare outcomes (such as death in the ETS) - for this reason we have also estimated the proportion of missing data by filtering top level variables required for the nested variable to be defined and then computed the proportion of notifications that were missing data for the outcome of interest.
Overview
Missing data may be MAR or MNAR, which may introduce biases into any analyses based on these data. Unfortunately MNAR data cannot be detected, so bias from this source cannot be discounted. However, it is possible to detect potential MAR mechanisms from observed variables that may not be included in a model used for analysis. Here we describe a method for this and apply it to several key outcomes including: Drug resistance (any treatment), BCG status, year of BCG vaccination, date of death, cause of death, date of symptom onset, date of diagnosis, date of starting treatment and date of ending treatment.
We reformulated the problem as a logistic regression for each variable of interest, with the outcome being data completeness (complete/missing). This allows variables that are hypothesised to be related to missing data to be adjusted for and their independent impact on data completeness to be estimated. This approach does not account for missingness within exploratory variables.
Statistical details
We took the following steps:
For the variable of interest create a new temporary binary variable, called data status, that is “Missing” when the variable of interest is missing and “Complete” when it is not. Specify “Complete” as the baseline.
For nested variables exclude notifications that do not have the top level outcome required by the variable of interest. An example of this is excluding cases that did not die, or have a missing overall outcome, when investigating TB mortality.
Specify the hypothesised drivers of missingness for the variable of interest. These should be variables with a reasonable hypothesis for how they would drive missingness in the variable of interest. They must also be relatively complete as this approach does not impute missing confounder data.
Fit a logistic regression model with the temporary data status variable as the outcome, adjusting for the hypothesised drivers of missingness.
Exponentiate the returned coefficients, and confidence intervals so that they represent Odds Ratios (ORs).
Refit the model, dropping each variable in turn and then comparing the updated model with the full model using a likelihood ratio test.
Interpret the results, using the estimated size of the effect, the width of the confidence intervals and the size of the Wald and likelihood ratio test p values to determine which variables are related to missingness for the variable of interest. Evidence should be interpreted on a spectrum, rather than using arbitrary significance cut-offs.[5] To avoid issues of multiple testing the level of evidence should be weighted based on the number of variables adjusted for and the number of outcomes explored.
For all outcomes considered we adjusted for the same set of demographic variables that were both highly complete, plausibly linked to missingness for all outcomes considered, and likely to be present in other comparable surveillance datasets. These were: year, sex, age (grouped as 0-14 year olds, 15-65 year olds and 65+), ethnic group, UK birth status, socio-economic status (national quintiles), and PHE centre (region). Complete case analysis was used, with the dataset limited to notifications from 2010 and on-wards as socio-economic status was not collected prior to this. The code for this approach is available as an R package online (https://doi.org/10.5281/zenodo.3492200).
In addition to data being MAR there may be other biases present. For date variables this is a particular issue with recall bias, reporting bias etc. potentially distorting temporal trends. We explore this by summarising the distribution of all date variables by month and then by day of the month, stratified by the introduction of the web based ETS system (2009). The date of notification is then used as a baseline for the inherent seasonal or monthly reporting structure. This approach allows potential biases to be identified and compared across the current and pre-web ETS. For each date data was restricted with only data from 2000 until 2015 being used.
We did not involve patients or the public in the design or planning of this study.
We found high completeness for common demographic variables such as sex, age, ethnic group and UK birth status (Supplementary Figure S1, Table 1). More problematically, BCG status and year of BCG status had a high percentage missing, even after accounting for the introduction of national collection of these variables in 2008.[2] Socio-economic status (as national quintiles) was not collected until 2010 but after this point is highly complete.[2] Comparing pre 2009 and post 2008 in Table 1 (Supplementary Figure S1) we see completeness changes over time.[2,6] There was some evidence that groups of variables had correlated missing data (Supplementary Figure S1).
| Variable | 2000-2008 | 2009-2015 |
|---|---|---|
| Socio-economic status (quintiles) | 100.0 (63175) | 15.7 (8120) |
| Year of BCG vaccination | 98.9 (62479) | 60.8 (31421) |
| BCG status | 98.0 (61916) | 33.2 (17133) |
| Date of diagnosis | 72.1 (45557) | 19.9 (10303) |
| Sputum smear status | 52.1 (32912) | 62.1 (32094) |
| Time since entry | 46.0 (29084) | 36.2 (18670) |
| Drug resistance | 43.5 (27485) | 40.7 (20995) |
| Occupation | 39.4 (24870) | 10.7 (5513) |
| Date of symptom onset | 37.9 (23937) | 24.8 (12829) |
| Treatment end date | 29.6 (18711) | 2.2 (1137) |
| Previous diagnosis | 20.9 (13204) | 6.1 (3148) |
| Date of starting treatment | 14.5 (9151) | 4.1 (2127) |
| Cause of death | 11.9 (7539) | 2.3 (1191) |
| UK birth status | 9.9 (6230) | 3.5 (1825) |
| Overall outcome | 9.6 (6044) | 0.0 (0) |
| Started treatment | 6.7 (4242) | 1.2 (602) |
| Ethnic group | 4.4 (2811) | 2.4 (1229) |
| Date of death | 2.0 (1235) | 0.7 (357) |
| Pulmonary or extra-pulmonary TB | 0.3 (177) | 0.4 (213) |
| Sex | 0.2 (101) | 0.2 (110) |
| Public Health England Centre | 0.1 (32) | 0.0 (0) |
| Age | 0.0 (25) | 0.0 (0) |
| Date of notification | 0.0 (0) | 0.0 (0) |
| Year | 0.0 (0) | 0.0 (0) |
| Culture | 0.0 (0) | 0.0 (0) |
By filtering nested variables - rather than by using replacement - we found the date of starting treatment was 5.9% (6434/108410) missing, which is more complete than previously estimated. For cases that were known to have completed treatment 16.5% (13804/83891) were missing a date for the end of treatment. In notifications that were known to have died, 26.6% (1592/5976) were missing the date of death and 44.9% (2686/5976) were missing the cause of death.
There was evidence that drug resistance was missing with a MAR mechanism for all variables considered (Table 2), excepting year of notification. Men were less likely to be missing than women. Chilrden were much more likely to have a missing drug resistance status than any other age group. The white ethnic group were less likely to be missing drug resistance than all other ethnic groups, excepting the Chinese ethnic group. The UK born population was more likely to be missing as were those from higher economic quintiles. Notifications in London were more likely to be missing drug resistance status than for most other PHE centres.
| Variable | Category | Missing (N) | Notifications (41659) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 40.7% (2905) | 7143 | *Reference* | 0.844 | |
| 2011 | 40.2% (3126) | 7781 | 0.97 (0.91, 1.04) | 0.428 | ||
| 2012 | 40.1% (3107) | 7755 | 0.96 (0.90, 1.03) | 0.278 | ||
| 2013 | 40.4% (2839) | 7034 | 0.98 (0.92, 1.05) | 0.625 | ||
| 2014 | 39.8% (2519) | 6327 | 0.96 (0.90, 1.03) | 0.29 | ||
| 2015 | 40.3% (2267) | 5619 | 1.00 (0.93, 1.07) | 0.896 | ||
| Sex | Female | 43.1% (7613) | 17664 | *Reference* | 4.81e-21 | |
| Male | 38.1% (9150) | 23995 | 0.82 (0.79, 0.86) | 4.67e-21 | ||
| Age | 0-14 | 76.1% (1365) | 1793 | *Reference* | 2.95e-229 | |
| 15-44 | 36.0% (9096) | 25235 | 0.18 (0.16, 0.21) | 6e-180 | ||
| 45-64 | 43.9% (3961) | 9026 | 0.26 (0.23, 0.29) | 2.07e-105 | ||
| 65+ | 41.8% (2341) | 5605 | 0.23 (0.20, 0.26) | 1.48e-112 | ||
| Ethnic group | White | 40.2% (3364) | 8359 | *Reference* | 8.17e-29 | |
| Black-Caribbean | 40.1% (372) | 928 | 0.99 (0.85, 1.14) | 0.854 | ||
| Black-African | 38.5% (2775) | 7204 | 1.07 (0.98, 1.16) | 0.133 | ||
| Black-Other | 42.5% (157) | 369 | 1.20 (0.96, 1.49) | 0.105 | ||
| Indian | 40.7% (4412) | 10848 | 1.24 (1.15, 1.34) | 1.58e-08 | ||
| Pakistani | 42.4% (2885) | 6806 | 1.31 (1.21, 1.41) | 2.7e-11 | ||
| Bangladeshi | 47.1% (791) | 1680 | 1.67 (1.48, 1.88) | 9.74e-18 | ||
| Chinese | 34.6% (171) | 494 | 0.97 (0.80, 1.18) | 0.787 | ||
| Mixed / Other | 36.9% (1836) | 4971 | 1.01 (0.92, 1.10) | 0.911 | ||
| UK birth status | Non-UK Born | 38.6% (11913) | 30880 | *Reference* | 3.1e-08 | |
| UK Born | 45.0% (4850) | 10779 | 1.19 (1.12, 1.26) | 3e-08 | ||
| Socio-economic status | 1 | 40.0% (6454) | 16131 | *Reference* | 0.000369 | |
| 2 | 39.7% (5005) | 12621 | 1.02 (0.97, 1.07) | 0.487 | ||
| 3 | 40.3% (2633) | 6530 | 1.06 (1.00, 1.13) | 0.0563 | ||
| 4 | 41.1% (1561) | 3796 | 1.10 (1.02, 1.19) | 0.0125 | ||
| 5 | 43.0% (1110) | 2581 | 1.21 (1.10, 1.32) | 4.57e-05 | ||
| Public Health England centre | London | 40.4% (7135) | 17658 | *Reference* | 6.46e-15 | |
| West Midlands | 43.6% (2274) | 5217 | 1.07 (1.00, 1.15) | 0.0416 | ||
| North West | 39.2% (1597) | 4075 | 0.87 (0.81, 0.94) | 0.000464 | ||
| South East | 38.2% (1542) | 4037 | 0.87 (0.81, 0.94) | 0.000287 | ||
| Yorkshire and the Humber | 40.9% (1257) | 3077 | 0.91 (0.84, 0.99) | 0.027 | ||
| East of England | 38.3% (1019) | 2662 | 0.87 (0.80, 0.95) | 0.00171 | ||
| East Midlands | 40.2% (1025) | 2548 | 0.95 (0.87, 1.04) | 0.27 | ||
| South West | 42.3% (674) | 1595 | 1.06 (0.95, 1.18) | 0.3 | ||
| North East | 30.4% (240) | 790 | 0.60 (0.51, 0.70) | 3.43e-10 |
Similarly to drug resistance there was evidence that BCG status was missing with a MAR mechanism for all variables considered (Table 3) with the stronger evidence for an association with year but reduced evidence of an association with socio-economic status. After adjusting for other variables data completeness increased from 2010 until 2012 but has since showed no clear trend. Men appeared to be more likely than women to have a missing BCG status, with the non-UK born also being more likely than the UK born to be missing BCG status. The proportion of those missing BCG status increased with age, with those aged 65+ being over 4 times more likely to be missing BCG status than those aged 0-14 years old. The White ethnic group was more likely to have a missing BCG status than any other ethnic group. London was associated with less reduced missingness for BCG status compated to other PHE centres.
Missingness for year of BC vaccination had similar associations as BCG status. However, there was less evidence of an association with sex, the white ethnic group were less likely to have a missing status than other ethnic groups, and there was strong evidence of an association with socio-economic status with those in the lowest quintile being more likely to have a missing year of BCG vaccination. London was much more likely to be missing BCG status than any other PHE centre, a reversal of the relationship observed for BCG status
| Variable | Category | Missing (N) | Notifications (41659) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 31.3% (2235) | 7143 | *Reference* | 1.6e-08 | |
| 2011 | 29.8% (2319) | 7781 | 0.94 (0.88, 1.01) | 0.107 | ||
| 2012 | 27.9% (2164) | 7755 | 0.85 (0.79, 0.92) | 1.93e-05 | ||
| 2013 | 27.1% (1907) | 7034 | 0.79 (0.73, 0.85) | 1.3e-09 | ||
| 2014 | 30.1% (1907) | 6327 | 0.90 (0.83, 0.97) | 0.00672 | ||
| 2015 | 29.7% (1668) | 5619 | 0.88 (0.81, 0.95) | 0.00104 | ||
| Sex | Female | 27.4% (4847) | 17664 | *Reference* | 5.21e-14 | |
| Male | 30.6% (7353) | 23995 | 1.19 (1.14, 1.24) | 5.97e-14 | ||
| Age | 0-14 | 13.1% (235) | 1793 | *Reference* | 8.49e-162 | |
| 15-44 | 26.0% (6557) | 25235 | 2.24 (1.94, 2.60) | 5.72e-27 | ||
| 45-64 | 32.8% (2964) | 9026 | 3.05 (2.63, 3.55) | 3.38e-47 | ||
| 65+ | 43.6% (2444) | 5605 | 4.82 (4.13, 5.64) | 1.93e-87 | ||
| Ethnic group | White | 35.4% (2959) | 8359 | *Reference* | 1.18e-14 | |
| Black-Caribbean | 24.6% (228) | 928 | 0.88 (0.74, 1.03) | 0.124 | ||
| Black-African | 27.3% (1966) | 7204 | 0.87 (0.79, 0.95) | 0.00235 | ||
| Black-Other | 24.1% (89) | 369 | 0.87 (0.67, 1.12) | 0.275 | ||
| Indian | 25.9% (2805) | 10848 | 0.71 (0.65, 0.77) | 3.69e-16 | ||
| Pakistani | 33.2% (2258) | 6806 | 0.85 (0.78, 0.93) | 0.000209 | ||
| Bangladeshi | 27.9% (469) | 1680 | 0.92 (0.81, 1.05) | 0.214 | ||
| Chinese | 33.6% (166) | 494 | 0.91 (0.74, 1.12) | 0.395 | ||
| Mixed / Other | 25.3% (1260) | 4971 | 0.80 (0.72, 0.88) | 5.15e-06 | ||
| UK birth status | Non-UK Born | 29.5% (9104) | 30880 | *Reference* | 7.78e-28 | |
| UK Born | 28.7% (3096) | 10779 | 0.68 (0.63, 0.73) | 2.69e-27 | ||
| Socio-economic status | 1 | 30.7% (4948) | 16131 | *Reference* | 0.0647 | |
| 2 | 26.8% (3383) | 12621 | 1.01 (0.95, 1.07) | 0.825 | ||
| 3 | 29.2% (1905) | 6530 | 1.09 (1.01, 1.16) | 0.0187 | ||
| 4 | 30.1% (1142) | 3796 | 0.98 (0.90, 1.06) | 0.616 | ||
| 5 | 31.8% (822) | 2581 | 0.96 (0.87, 1.06) | 0.415 | ||
| Public Health England centre | London | 21.0% (3716) | 17658 | *Reference* | 0 | |
| West Midlands | 22.4% (1171) | 5217 | 1.08 (0.99, 1.16) | 0.066 | ||
| North West | 51.8% (2112) | 4075 | 4.16 (3.85, 4.49) | 4.44e-286 | ||
| South East | 26.6% (1074) | 4037 | 1.33 (1.23, 1.45) | 7.73e-12 | ||
| Yorkshire and the Humber | 37.0% (1138) | 3077 | 2.24 (2.05, 2.44) | 1.35e-72 | ||
| East of England | 36.4% (969) | 2662 | 2.12 (1.94, 2.32) | 6.4e-61 | ||
| East Midlands | 45.3% (1154) | 2548 | 3.20 (2.93, 3.50) | 4.07e-145 | ||
| South West | 41.2% (657) | 1595 | 2.55 (2.28, 2.85) | 5.96e-62 | ||
| North East | 26.5% (209) | 790 | 1.31 (1.11, 1.55) | 0.0013 |
For date of symptom onset there was strong evidence of an MAR mechanism for all variables considered, except for sex (Table 4). The likelihood of date of symptom onset being missing reduced with year of notification. Children (0-14 years old) were more likely to have a missing date of symptom onset than any other age group as were those in any socio-economic quintile when compared to the poorest group. UK born cases were more likely to have a complete date of symptom onset than non-UK born cases, with the White ethnic group being more likely to have a missing date of symptom onset than most other ethnic groups. London was again associated with a increased level of missing data compared to other PHE centres
| Variable | Category | Missing (N) | Notifications (41659) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 34.0% (2426) | 7143 | *Reference* | 0 | |
| 2011 | 30.1% (2339) | 7781 | 0.84 (0.78, 0.90) | 1.45e-06 | ||
| 2012 | 24.2% (1878) | 7755 | 0.61 (0.57, 0.66) | 1.73e-38 | ||
| 2013 | 17.5% (1233) | 7034 | 0.41 (0.37, 0.44) | 2.6e-105 | ||
| 2014 | 11.8% (744) | 6327 | 0.25 (0.23, 0.27) | 6.1e-187 | ||
| 2015 | 6.9% (390) | 5619 | 0.14 (0.12, 0.15) | 1.7e-245 | ||
| Sex | Female | 22.0% (3894) | 17664 | *Reference* | 0.363 | |
| Male | 21.3% (5116) | 23995 | 0.98 (0.93, 1.03) | 0.363 | ||
| Age | 0-14 | 38.1% (684) | 1793 | *Reference* | 6.9e-78 | |
| 15-44 | 20.5% (5182) | 25235 | 0.33 (0.30, 0.38) | 4.33e-78 | ||
| 45-64 | 20.7% (1870) | 9026 | 0.36 (0.32, 0.41) | 4.15e-58 | ||
| 65+ | 22.7% (1274) | 5605 | 0.44 (0.39, 0.51) | 3.41e-34 | ||
| Ethnic group | White | 20.9% (1749) | 8359 | *Reference* | 1.53e-08 | |
| Black-Caribbean | 23.1% (214) | 928 | 0.76 (0.64, 0.90) | 0.00216 | ||
| Black-African | 23.0% (1654) | 7204 | 0.72 (0.65, 0.79) | 7.47e-11 | ||
| Black-Other | 18.7% (69) | 369 | 0.61 (0.45, 0.80) | 0.000611 | ||
| Indian | 22.2% (2404) | 10848 | 0.76 (0.70, 0.84) | 1.17e-08 | ||
| Pakistani | 19.2% (1305) | 6806 | 0.79 (0.72, 0.87) | 3.23e-06 | ||
| Bangladeshi | 23.9% (401) | 1680 | 0.80 (0.69, 0.92) | 0.00178 | ||
| Chinese | 18.8% (93) | 494 | 0.68 (0.53, 0.87) | 0.0025 | ||
| Mixed / Other | 22.6% (1121) | 4971 | 0.79 (0.71, 0.88) | 1.07e-05 | ||
| UK birth status | Non-UK Born | 21.9% (6774) | 30880 | *Reference* | 0.000152 | |
| UK Born | 20.7% (2236) | 10779 | 0.86 (0.80, 0.93) | 0.00016 | ||
| Socio-economic status | 1 | 19.9% (3218) | 16131 | *Reference* | 1.06e-06 | |
| 2 | 22.9% (2888) | 12621 | 0.98 (0.92, 1.05) | 0.63 | ||
| 3 | 24.2% (1578) | 6530 | 1.17 (1.08, 1.26) | 7.32e-05 | ||
| 4 | 22.0% (837) | 3796 | 1.18 (1.07, 1.29) | 0.000845 | ||
| 5 | 18.9% (489) | 2581 | 1.17 (1.04, 1.31) | 0.01 | ||
| Public Health England centre | London | 30.0% (5289) | 17658 | *Reference* | 0 | |
| West Midlands | 12.0% (627) | 5217 | 0.30 (0.27, 0.33) | 8.63e-137 | ||
| North West | 20.6% (841) | 4075 | 0.56 (0.51, 0.61) | 5.62e-36 | ||
| South East | 9.0% (363) | 4037 | 0.20 (0.18, 0.23) | 7.15e-156 | ||
| Yorkshire and the Humber | 13.2% (407) | 3077 | 0.32 (0.28, 0.35) | 4.19e-83 | ||
| East of England | 26.5% (705) | 2662 | 0.80 (0.72, 0.88) | 4.54e-06 | ||
| East Midlands | 19.2% (488) | 2548 | 0.52 (0.47, 0.58) | 2.21e-32 | ||
| South West | 10.9% (174) | 1595 | 0.27 (0.23, 0.32) | 6.79e-53 | ||
| North East | 14.7% (116) | 790 | 0.39 (0.31, 0.47) | 1.9e-19 |
For date of diagnosis there was again strong evidence for an MAR mechanism for all variables considered, except for sex (Supplementary Table S1). Increasing completeness was found for year of notification as seen previously, as was an increased likelihood of missing data in males and the non-UK born. The White ethnic group was more likely to be missing data on the data of diagnosis as compared to the majority of other ethnic groups. The poorest socio-economic group was less likely to be missing data compared to all other socio-economic quintiles. Children (0-14 years old) were again more likely to be missing data than adults in any age group. As for other variables London had a much higher proportion of missing data than any other PHE centre.
For date of starting treatment there was evidence that missing data was again associated with all variables considered, excepting UK birth status and socio-economic status (Supplementary Table S2). Missingness for the date of ending treatment was associated with fewer variables, with evidence only of associations between year, and PHE centre (Supplementary Table S3). For both variables the proportion of missing data reduced with the year of notification. London had a lower proportion of missing data when compared to most other PHE centres. For the date of starting treatment the White ethnic group were more likely to be missing data than other groups. Older age groups were also more likely to be missing data, as were males.
For date of death there was little evidence of any association, except for PHE centre (Supplementary Table S4). This was also the case for cause of death but there was some additional evidence of an association with ethnic group (Supplementary Table S5). There was little evidence of a clear trend across ethnic groups for cause of death. As for other outcomes London was much more likely to be missing date of death than other PHE centres. This relationship was reversed for cause of death. Both date of death and cause of death had a small sample size and this may mean that these analyses were underpowered.
Notifications showed evidence of a strong seasonal trend with a peak in the number of notifications in May-July each year but had a near uniform distribution within each month (Supplementary Table S6). There was little evidence of strong biases in this reporting and there was little evidence to suggest that the introduction of the web based ETS impacted the distribution of notifications or the levels of bias. The date of symptom onset showed evidence of an inverted seasonal trend - in comparison to notifications (Table 5) . There was evidence that reporting in January may be biased with a much greater proportion of cases reported as having symptoms starting in this month than in any other. There was also evidence that cases were more likely to have symptoms start on the first and the 14th of each month, again indicating bias. Both of these apparent biases were reduced by the introduction of the web based ETS but were still present. The date of ending treatment also showed some evidence of these biases and had the same inverted seasonal trend as the date of sympton onset (Supplementary Table S7). The date of diagnosis, date of starting treatment and date of death showed a similar reporting structure to notifications although the strength of the seasonal trend was reduced (see the supplementary information). There was little clear evidence of biases in reporting either by month, or by day for these variables.
Figure 1: a.) Shows the proportion of cases with symptons starting in a given month for each year with some evidence of bias in January and reduced evidence of a seasonal trend. b.) Shows the proportion of cases with symptons starting on a given day for each month with a strong evidence of biased reporting on the first of the month and the 14th. Stratifying both figures based on the introduction of the web based ETS indicates that the web based ETS may have reduced these biases.
In the ETS system we found a high degree of missing data for several important variables. We also found that there is likely to be strong missing at random (MAR) mechanism underlying this missing data for multiple variables. Several factors are strongly associated with data being missing for many variables, including UK birth status, ethnic group, socio-economic status and year. These MAR mechanisms must be adjusted for in studies using this data to avoid introducing bias. We found that date variables in particular suffered from changing data completeness over time, which may introduce spurious temporal trends if not fully understood.
The following analysis is not currently in the paper but it was in the chapter - is there a case for including?
We also found that for several variables, including the date of symptom onset, there was a large degree of recall bias when aggregating by day or month. Several variables, including date of notification and date of starting treatment, showed a seasonal trend with a maximum in the summer months. The date of ending treatment showed less evidence of a seasonal trend.
Work in progress - copied from chapter text
Routine observational datasets are subject to numerous potential biases, such as selection bias, recall bias, measurement bias, and unmeasured confounding.[7] Additionally, as the data has not been collected with a specific analysis in mind there maybe issues with the specificity of variables. The ETS system is likely to suffer from all of the above biases to some extent, which must be accounted for as far as possbile, and explicitly stated at every level of analysis. The most important consideration is that the ETS system is unlikely to be representative of the general population as it contains only notified TB cases that occurred in England during the study period, research questions must therefore be either limited to active TB patients, or when extended to the general population the differing population demographics must be accounted for. If this is not done then any results may be due to selection bias. Additionally, multiple variables may suffer from misclassification bias, including BCG status which can be assessed via vaccination record, the presence of a scar, or case recall: this may lead to spurious associations.[8] Validation studies would be required to account for this.
Unlike classic approaches to missing data, such as multiple imputation by chained regression (MICE),[9] this is not an imputation
Acknowledgements
The authors thank the TB section at Public Health England (PHE) for maintaining the Enhanced Tuberculosis Surveillance (ETS) system; all the healthcare workers involved in data collection for the ETS.
Contributors
SA conceived and designed the work. SA undertook the analysis with advice from all other authors. All authors contributed to the interpretation of the data. SA wrote the first draft of the paper and all authors contributed to subsequent drafts. All authors approve the work for publication and agree to be accountable for the work.
Funding
SEA, HC, and EBP are funded by the National Institute for Health Research Health Protection Research Unit (NIHR HPRU) in Evaluation of Interventions at University of Bristol in partnership with Public Health England (PHE). The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR, the Department of Health or Public Health England.
Conflicts of interest
HC reports receiving honoraria from Sanofi Pasteur, and consultancy fees from AstraZeneca, GSK and IMS Health, all paid to her employer.
Accessibility of programming code
The code for the analysis contained in this paper can be found at: https://doi.org/10.5281/zenodo.3492200
1 Sterne JAC, White IR, Carlin JB et al. Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. Bmj 2009;338:b2393–3.
2 Public Health England. Tuberculosis in England 2017 report ( presenting data to end of 2016 ) About Public Health England. 2017.
3 Kruijshaar M, French C, Anderson C et al. Tuberculosis in the UK, Annual report on tuberculosis surveillance and control in the UK 2007. Thorax 2007;50:703–3.
4 Pillaye J, Clarke A. An evaluation of completeness of tuberculosis notification in the United Kingdom. BMC Public Health 2003;3:31.
5 Sterne JA, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? Bmj 2001;322:226–31.
6 PHE. Tuberculosis in England 2016 Report (presenting data to end of 2015). 2016.
7 Benchimol EI, Smeeth L, Guttmann A et al. The REporting of studies Conducted using Observational Routinely-collected health Data (RECORD) Statement. The American Statistician 2016;115-116:1–22.
8 Fewell Z, Davey Smith G, Sterne JAC. The impact of residual and unmeasured confounding in epidemiologic studies: A simulation study. American Journal of Epidemiology 2007;166:646–55.
9 Groothuis-oudshoorn K. Journal of Statistical Software MICE : Multivariate Imputation by Chained.;VV.
Sam Abbott, Hannah Christensen, Ellen Brooks-Pollock
Supplementary Figure S1: Summary plot of missing data in the extract of the ETS data used in this thesis. Due to the large size of the dataset, the data has been sub-sampled with only 20% of the data shown in this figure. Notifications have been ordered by date of notification from left to right. The following subset of variables are shown: year (year), sex (sex), age (age), PHE Centre (phec), Occupation (occat), Ethnic group (ethgrp), UK birth status (ukborn), Time since entry (timesinceent), date of symptom onset (symptonset), date of diagnosis (datediag), started treatment (startedtreat), date of starting treatment (starttreatdate), treatment end date (txenddate), pulmonary or extra-pulmonary TB (pulmextrapulm), culture (culture), sputum smear status (sputsmear), drug resistance (anyres), previous diagnosis (prevdiag), BCG status(bcgvacc), Year of BCG vaccination (bcgvaccyr), overall outcome (overalloutcome), cause of death (tomdeathrelate), socio-economic status quintiles (natquintile), and date of death (dateofdeath). Nested variables have been accounted for (i.e date of death has had an entry added for cases that are known to have not died).
| Variable | Category | Missing (N) | Notifications (20835) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 61.0% (2090) | 3424 | *Reference* | 1.59e-09 | |
| 2011 | 59.6% (2304) | 3869 | 0.90 (0.79, 1.03) | 0.134 | ||
| 2012 | 56.2% (2216) | 3945 | 0.73 (0.64, 0.84) | 6.21e-06 | ||
| 2013 | 55.7% (2025) | 3638 | 0.75 (0.65, 0.86) | 2.71e-05 | ||
| 2014 | 56.6% (1776) | 3138 | 0.83 (0.72, 0.95) | 0.00891 | ||
| 2015 | 54.2% (1530) | 2821 | 0.64 (0.55, 0.74) | 1.34e-09 | ||
| Sex | Female | 55.5% (5089) | 9174 | *Reference* | 0.275 | |
| Male | 58.8% (6852) | 11661 | 1.05 (0.97, 1.13) | 0.275 | ||
| Age | 0-14 | 43.9% (488) | 1111 | *Reference* | 1.21e-20 | |
| 15-44 | 58.3% (8216) | 14102 | 2.12 (1.77, 2.53) | 1.38e-16 | ||
| 45-64 | 57.6% (2526) | 4388 | 2.42 (1.99, 2.94) | 6.72e-19 | ||
| 65+ | 57.6% (711) | 1234 | 3.00 (2.36, 3.83) | 5.09e-19 | ||
| Ethnic group | White | 44.2% (1370) | 3102 | *Reference* | 5.86e-12 | |
| Black-Caribbean | 77.5% (371) | 479 | 1.19 (0.89, 1.61) | 0.242 | ||
| Black-African | 65.2% (2524) | 3870 | 0.91 (0.78, 1.07) | 0.261 | ||
| Black-Other | 72.0% (154) | 214 | 1.23 (0.80, 1.90) | 0.349 | ||
| Indian | 56.1% (3516) | 6267 | 0.75 (0.65, 0.86) | 7.27e-05 | ||
| Pakistani | 51.6% (1583) | 3066 | 1.10 (0.95, 1.28) | 0.205 | ||
| Bangladeshi | 73.1% (583) | 797 | 1.48 (1.15, 1.90) | 0.00226 | ||
| Chinese | 58.2% (142) | 244 | 1.23 (0.83, 1.80) | 0.3 | ||
| Mixed / Other | 60.7% (1698) | 2796 | 0.83 (0.70, 0.98) | 0.0318 | ||
| UK birth status | Non-UK Born | 61.1% (9665) | 15808 | *Reference* | 5.14e-08 | |
| UK Born | 45.3% (2276) | 5027 | 0.74 (0.66, 0.82) | 4.98e-08 | ||
| Socio-economic status | 1 | 55.4% (4221) | 7615 | *Reference* | 4.64e-05 | |
| 2 | 66.3% (4463) | 6729 | 0.88 (0.79, 0.97) | 0.0118 | ||
| 3 | 59.4% (2019) | 3401 | 0.84 (0.74, 0.95) | 0.00684 | ||
| 4 | 45.3% (838) | 1848 | 0.70 (0.60, 0.82) | 6.29e-06 | ||
| 5 | 32.2% (400) | 1242 | 0.78 (0.65, 0.93) | 0.00583 | ||
| Public Health England centre | London | 91.0% (9421) | 10358 | *Reference* | 0 | |
| West Midlands | 39.3% (1010) | 2568 | 0.06 (0.05, 0.07) | 0 | ||
| North West | 9.2% (116) | 1260 | 0.01 (0.01, 0.01) | 0 | ||
| South East | 13.0% (293) | 2261 | 0.01 (0.01, 0.02) | 0 | ||
| Yorkshire and the Humber | 45.2% (528) | 1167 | 0.08 (0.07, 0.09) | 2.85e-255 | ||
| East of England | 19.9% (260) | 1305 | 0.02 (0.02, 0.03) | 0 | ||
| East Midlands | 3.1% (33) | 1066 | 0.00 (0.00, 0.00) | 2.6e-224 | ||
| South West | 38.4% (175) | 456 | 0.06 (0.05, 0.08) | 4.24e-153 | ||
| North East | 26.6% (105) | 394 | 0.03 (0.03, 0.04) | 2.87e-172 |
| Variable | Category | Missing (N) | Notifications (41659) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 26.9% (1918) | 7143 | *Reference* | 7.54e-286 | |
| 2011 | 22.3% (1736) | 7781 | 0.77 (0.71, 0.83) | 2.11e-10 | ||
| 2012 | 18.8% (1458) | 7755 | 0.61 (0.56, 0.66) | 3.93e-31 | ||
| 2013 | 12.9% (909) | 7034 | 0.38 (0.35, 0.42) | 6.81e-91 | ||
| 2014 | 10.4% (659) | 6327 | 0.30 (0.27, 0.33) | 6.2e-120 | ||
| 2015 | 7.4% (415) | 5619 | 0.20 (0.18, 0.22) | 1.56e-158 | ||
| Sex | Female | 16.9% (2984) | 17664 | *Reference* | 0.432 | |
| Male | 17.1% (4111) | 23995 | 1.02 (0.97, 1.08) | 0.432 | ||
| Age | 0-14 | 19.4% (348) | 1793 | *Reference* | 0.000251 | |
| 15-44 | 17.8% (4504) | 25235 | 0.74 (0.65, 0.86) | 4.77e-05 | ||
| 45-64 | 15.9% (1434) | 9026 | 0.73 (0.62, 0.85) | 3.52e-05 | ||
| 65+ | 14.4% (809) | 5605 | 0.79 (0.68, 0.94) | 0.00563 | ||
| Ethnic group | White | 12.5% (1043) | 8359 | *Reference* | 6.85e-08 | |
| Black-Caribbean | 25.2% (234) | 928 | 1.20 (1.00, 1.43) | 0.0469 | ||
| Black-African | 21.9% (1577) | 7204 | 0.99 (0.89, 1.11) | 0.876 | ||
| Black-Other | 17.9% (66) | 369 | 0.75 (0.56, 1.01) | 0.0612 | ||
| Indian | 18.0% (1957) | 10848 | 0.80 (0.72, 0.89) | 4.94e-05 | ||
| Pakistani | 11.8% (805) | 6806 | 0.86 (0.76, 0.97) | 0.0158 | ||
| Bangladeshi | 21.5% (361) | 1680 | 0.94 (0.81, 1.10) | 0.469 | ||
| Chinese | 13.4% (66) | 494 | 0.66 (0.49, 0.88) | 0.00525 | ||
| Mixed / Other | 19.8% (986) | 4971 | 0.91 (0.81, 1.02) | 0.117 | ||
| UK birth status | Non-UK Born | 18.4% (5696) | 30880 | *Reference* | 0.00227 | |
| UK Born | 13.0% (1399) | 10779 | 0.87 (0.80, 0.95) | 0.00235 | ||
| Socio-economic status | 1 | 14.4% (2317) | 16131 | *Reference* | 6.01e-14 | |
| 2 | 19.6% (2469) | 12621 | 0.97 (0.90, 1.04) | 0.394 | ||
| 3 | 20.3% (1325) | 6530 | 1.22 (1.12, 1.33) | 5.3e-06 | ||
| 4 | 17.0% (645) | 3796 | 1.30 (1.17, 1.45) | 1.87e-06 | ||
| 5 | 13.1% (339) | 2581 | 1.42 (1.23, 1.62) | 9.74e-07 | ||
| Public Health England centre | London | 31.0% (5471) | 17658 | *Reference* | 0 | |
| West Midlands | 3.6% (190) | 5217 | 0.08 (0.07, 0.10) | 4.97e-226 | ||
| North West | 7.6% (308) | 4075 | 0.18 (0.15, 0.20) | 6.15e-159 | ||
| South East | 3.9% (157) | 4037 | 0.08 (0.07, 0.09) | 4e-193 | ||
| Yorkshire and the Humber | 3.2% (99) | 3077 | 0.07 (0.06, 0.09) | 1.51e-137 | ||
| East of England | 11.3% (302) | 2662 | 0.26 (0.23, 0.30) | 2.32e-93 | ||
| East Midlands | 18.9% (482) | 2548 | 0.51 (0.46, 0.57) | 2.4e-33 | ||
| South West | 2.8% (45) | 1595 | 0.06 (0.05, 0.08) | 8.96e-73 | ||
| North East | 5.2% (41) | 790 | 0.12 (0.09, 0.17) | 5.45e-38 |
| Variable | Category | Missing (N) | Notifications (41659) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 5.1% (367) | 7143 | *Reference* | 2.48e-37 | |
| 2011 | 4.7% (368) | 7781 | 0.92 (0.79, 1.07) | 0.281 | ||
| 2012 | 4.0% (314) | 7755 | 0.77 (0.66, 0.90) | 0.00121 | ||
| 2013 | 3.8% (265) | 7034 | 0.70 (0.59, 0.82) | 1.7e-05 | ||
| 2014 | 2.2% (139) | 6327 | 0.39 (0.32, 0.47) | 1.36e-20 | ||
| 2015 | 2.0% (115) | 5619 | 0.36 (0.29, 0.45) | 1.65e-20 | ||
| Sex | Female | 3.4% (608) | 17664 | *Reference* | 0.00223 | |
| Male | 4.0% (960) | 23995 | 1.18 (1.06, 1.31) | 0.00234 | ||
| Age | 0-14 | 3.6% (64) | 1793 | *Reference* | 1.89e-29 | |
| 15-44 | 3.1% (774) | 25235 | 0.89 (0.68, 1.17) | 0.384 | ||
| 45-64 | 3.4% (310) | 9026 | 0.93 (0.70, 1.25) | 0.628 | ||
| 65+ | 7.5% (420) | 5605 | 1.96 (1.49, 2.63) | 3.16e-06 | ||
| Ethnic group | White | 5.8% (486) | 8359 | *Reference* | 0.00077 | |
| Black-Caribbean | 3.4% (32) | 928 | 0.71 (0.48, 1.02) | 0.0765 | ||
| Black-African | 2.8% (203) | 7204 | 0.61 (0.49, 0.76) | 7.46e-06 | ||
| Black-Other | 3.3% (12) | 369 | 0.79 (0.42, 1.38) | 0.445 | ||
| Indian | 3.4% (371) | 10848 | 0.71 (0.59, 0.86) | 0.000401 | ||
| Pakistani | 3.6% (243) | 6806 | 0.63 (0.52, 0.77) | 4.66e-06 | ||
| Bangladeshi | 3.1% (52) | 1680 | 0.66 (0.48, 0.90) | 0.0108 | ||
| Chinese | 3.8% (19) | 494 | 0.78 (0.46, 1.24) | 0.318 | ||
| Mixed / Other | 3.0% (150) | 4971 | 0.70 (0.55, 0.87) | 0.00173 | ||
| UK birth status | Non-UK Born | 3.4% (1045) | 30880 | *Reference* | 0.516 | |
| UK Born | 4.9% (523) | 10779 | 0.95 (0.81, 1.11) | 0.516 | ||
| Socio-economic status | 1 | 3.8% (611) | 16131 | *Reference* | 0.665 | |
| 2 | 3.7% (462) | 12621 | 1.05 (0.92, 1.20) | 0.481 | ||
| 3 | 3.5% (226) | 6530 | 0.92 (0.78, 1.09) | 0.336 | ||
| 4 | 4.1% (154) | 3796 | 0.99 (0.82, 1.20) | 0.934 | ||
| 5 | 4.5% (115) | 2581 | 1.01 (0.81, 1.25) | 0.925 | ||
| Public Health England centre | London | 3.1% (551) | 17658 | *Reference* | 2.84e-17 | |
| West Midlands | 3.8% (198) | 5217 | 1.11 (0.93, 1.32) | 0.229 | ||
| North West | 4.3% (176) | 4075 | 1.27 (1.05, 1.52) | 0.0112 | ||
| South East | 3.0% (121) | 4037 | 0.87 (0.71, 1.07) | 0.194 | ||
| Yorkshire and the Humber | 6.6% (202) | 3077 | 2.03 (1.70, 2.43) | 8.5e-15 | ||
| East of England | 3.3% (88) | 2662 | 0.97 (0.77, 1.22) | 0.815 | ||
| East Midlands | 3.2% (82) | 2548 | 0.93 (0.73, 1.17) | 0.542 | ||
| South West | 6.9% (110) | 1595 | 1.94 (1.54, 2.41) | 5.75e-09 | ||
| North East | 5.1% (40) | 790 | 1.44 (1.01, 1.99) | 0.0342 |
| Variable | Category | Missing (N) | Notifications (33606) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 2.9% (182) | 6171 | *Reference* | 4.89e-15 | |
| 2011 | 2.6% (177) | 6855 | 0.88 (0.71, 1.08) | 0.228 | ||
| 2012 | 2.4% (164) | 6882 | 0.78 (0.63, 0.97) | 0.0274 | ||
| 2013 | 1.5% (97) | 6298 | 0.49 (0.38, 0.63) | 3.05e-08 | ||
| 2014 | 1.2% (66) | 5341 | 0.38 (0.29, 0.51) | 5.33e-11 | ||
| 2015 | 1.4% (28) | 2059 | 0.47 (0.31, 0.69) | 0.000223 | ||
| Sex | Female | 2.1% (311) | 14630 | *Reference* | 0.506 | |
| Male | 2.1% (403) | 18976 | 1.05 (0.91, 1.23) | 0.507 | ||
| Age | 0-14 | 2.7% (44) | 1617 | *Reference* | 0.52 | |
| 15-44 | 2.0% (419) | 21027 | 0.81 (0.59, 1.14) | 0.209 | ||
| 45-64 | 2.3% (165) | 7272 | 0.83 (0.59, 1.20) | 0.314 | ||
| 65+ | 2.3% (86) | 3690 | 0.74 (0.50, 1.11) | 0.141 | ||
| Ethnic group | White | 2.9% (176) | 6076 | *Reference* | 0.0466 | |
| Black-Caribbean | 2.8% (21) | 753 | 1.51 (0.91, 2.38) | 0.0888 | ||
| Black-African | 1.9% (114) | 6071 | 0.90 (0.66, 1.23) | 0.512 | ||
| Black-Other | 2.3% (7) | 306 | 1.34 (0.56, 2.75) | 0.464 | ||
| Indian | 1.7% (150) | 8842 | 0.72 (0.55, 0.96) | 0.0235 | ||
| Pakistani | 2.5% (140) | 5668 | 0.86 (0.65, 1.13) | 0.282 | ||
| Bangladeshi | 1.3% (18) | 1409 | 0.65 (0.37, 1.07) | 0.105 | ||
| Chinese | 2.8% (11) | 396 | 1.17 (0.58, 2.14) | 0.643 | ||
| Mixed / Other | 1.9% (77) | 4085 | 0.98 (0.70, 1.35) | 0.887 | ||
| UK birth status | Non-UK Born | 1.9% (480) | 25174 | *Reference* | 0.959 | |
| UK Born | 2.8% (234) | 8432 | 1.01 (0.81, 1.25) | 0.959 | ||
| Socio-economic status | 1 | 2.4% (308) | 13080 | *Reference* | 0.257 | |
| 2 | 1.7% (170) | 10266 | 1.03 (0.84, 1.26) | 0.752 | ||
| 3 | 1.9% (100) | 5265 | 1.09 (0.85, 1.38) | 0.498 | ||
| 4 | 2.8% (84) | 2994 | 1.36 (1.04, 1.76) | 0.021 | ||
| 5 | 2.6% (52) | 2001 | 1.08 (0.78, 1.47) | 0.619 | ||
| Public Health England centre | London | 0.7% (100) | 14747 | *Reference* | 8.46e-59 | |
| West Midlands | 4.2% (177) | 4240 | 6.68 (5.16, 8.69) | 2e-46 | ||
| North West | 2.7% (88) | 3208 | 4.16 (3.07, 5.63) | 2.21e-20 | ||
| South East | 2.5% (79) | 3213 | 3.57 (2.62, 4.84) | 3.41e-16 | ||
| Yorkshire and the Humber | 2.8% (67) | 2361 | 4.34 (3.12, 6.01) | 1.06e-18 | ||
| East of England | 4.0% (83) | 2098 | 5.88 (4.35, 7.94) | 6.86e-31 | ||
| East Midlands | 3.1% (63) | 2039 | 4.77 (3.44, 6.58) | 2.87e-21 | ||
| South West | 2.9% (32) | 1122 | 4.22 (2.76, 6.29) | 6.37e-12 | ||
| North East | 4.3% (25) | 578 | 6.73 (4.19, 10.44) | 2.16e-16 |
| Variable | Category | Missing (N) | Notifications (1883) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 16.6% (53) | 320 | *Reference* | 0.129 | |
| 2011 | 15.9% (52) | 327 | 1.02 (0.63, 1.65) | 0.938 | ||
| 2012 | 14.5% (51) | 351 | 0.88 (0.54, 1.42) | 0.593 | ||
| 2013 | 13.5% (42) | 312 | 0.70 (0.43, 1.16) | 0.169 | ||
| 2014 | 9.5% (30) | 317 | 0.55 (0.32, 0.93) | 0.0263 | ||
| 2015 | 13.3% (34) | 256 | 0.67 (0.39, 1.14) | 0.14 | ||
| Sex | Female | 14.8% (97) | 657 | *Reference* | 0.569 | |
| Male | 13.5% (165) | 1226 | 0.91 (0.67, 1.25) | 0.568 | ||
| Age | 0-14 | 10.0% (1) | 10 | *Reference* | 0.799 | |
| 15-44 | 15.7% (31) | 198 | 1.86 (0.26, 38.77) | 0.596 | ||
| 45-64 | 14.6% (68) | 465 | 1.85 (0.26, 38.20) | 0.598 | ||
| 65+ | 13.4% (162) | 1210 | 2.11 (0.30, 43.43) | 0.521 | ||
| Ethnic group | White | 11.1% (102) | 920 | *Reference* | 0.9 | |
| Black-Caribbean | 21.7% (10) | 46 | 0.90 (0.35, 2.18) | 0.817 | ||
| Black-African | 20.1% (27) | 134 | 0.92 (0.45, 1.92) | 0.833 | ||
| Black-Other | 20.0% (1) | 5 | 0.52 (0.03, 4.31) | 0.586 | ||
| Indian | 17.4% (64) | 367 | 0.90 (0.49, 1.70) | 0.747 | ||
| Pakistani | 8.0% (20) | 249 | 0.62 (0.30, 1.29) | 0.204 | ||
| Bangladeshi | 22.7% (10) | 44 | 0.85 (0.33, 2.12) | 0.731 | ||
| Chinese | 14.3% (3) | 21 | 0.80 (0.16, 3.23) | 0.772 | ||
| Mixed / Other | 25.8% (25) | 97 | 1.15 (0.55, 2.39) | 0.711 | ||
| UK birth status | Non-UK Born | 16.6% (167) | 1004 | *Reference* | 0.796 | |
| UK Born | 10.8% (95) | 879 | 1.08 (0.61, 1.92) | 0.796 | ||
| Socio-economic status | 1 | 11.4% (79) | 695 | *Reference* | 0.912 | |
| 2 | 18.3% (86) | 470 | 0.87 (0.59, 1.29) | 0.499 | ||
| 3 | 16.2% (48) | 296 | 1.04 (0.66, 1.64) | 0.87 | ||
| 4 | 12.7% (30) | 237 | 1.02 (0.60, 1.71) | 0.937 | ||
| 5 | 10.3% (19) | 185 | 0.87 (0.46, 1.59) | 0.651 | ||
| Public Health England centre | London | 37.6% (201) | 534 | *Reference* | 1.92e-57 | |
| West Midlands | 2.3% (7) | 305 | 0.04 (0.02, 0.07) | 1.61e-16 | ||
| North West | 7.0% (16) | 228 | 0.12 (0.07, 0.21) | 5.23e-13 | ||
| South East | 4.8% (10) | 208 | 0.08 (0.04, 0.15) | 2.25e-13 | ||
| Yorkshire and the Humber | 3.6% (6) | 168 | 0.06 (0.02, 0.12) | 4.81e-11 | ||
| East of England | 8.5% (11) | 130 | 0.14 (0.07, 0.26) | 6.32e-09 | ||
| East Midlands | 1.9% (3) | 156 | 0.03 (0.01, 0.08) | 2.58e-09 | ||
| South West | 6.7% (7) | 105 | 0.11 (0.04, 0.23) | 7.77e-08 | ||
| North East | 2.0% (1) | 49 | 0.03 (0.00, 0.15) | 0.000694 |
| Variable | Category | Missing (N) | Notifications (1883) | Odds Ratio | P value (Wald) | P value (LRT) |
|---|---|---|---|---|---|---|
| Year | 2010 | 45.0% (144) | 320 | *Reference* | 0.576 | |
| 2011 | 45.6% (149) | 327 | 0.99 (0.71, 1.37) | 0.944 | ||
| 2012 | 45.3% (159) | 351 | 0.94 (0.68, 1.29) | 0.694 | ||
| 2013 | 43.9% (137) | 312 | 0.94 (0.67, 1.30) | 0.693 | ||
| 2014 | 44.8% (142) | 317 | 0.86 (0.62, 1.20) | 0.379 | ||
| 2015 | 38.7% (99) | 256 | 0.74 (0.52, 1.05) | 0.0933 | ||
| Sex | Female | 44.7% (294) | 657 | *Reference* | 0.763 | |
| Male | 43.7% (536) | 1226 | 0.97 (0.79, 1.19) | 0.763 | ||
| Age | 0-14 | 50.0% (5) | 10 | *Reference* | 0.14 | |
| 15-44 | 35.4% (70) | 198 | 0.69 (0.17, 2.82) | 0.6 | ||
| 45-64 | 43.0% (200) | 465 | 1.02 (0.25, 4.11) | 0.977 | ||
| 65+ | 45.9% (555) | 1210 | 1.03 (0.25, 4.13) | 0.965 | ||
| Ethnic group | White | 48.2% (443) | 920 | *Reference* | 0.00768 | |
| Black-Caribbean | 21.7% (10) | 46 | 0.47 (0.20, 0.99) | 0.0565 | ||
| Black-African | 45.5% (61) | 134 | 1.78 (1.04, 3.03) | 0.0347 | ||
| Black-Other | 20.0% (1) | 5 | 0.70 (0.03, 5.37) | 0.761 | ||
| Indian | 35.7% (131) | 367 | 0.87 (0.56, 1.36) | 0.545 | ||
| Pakistani | 49.4% (123) | 249 | 1.33 (0.84, 2.11) | 0.224 | ||
| Bangladeshi | 27.3% (12) | 44 | 0.82 (0.36, 1.78) | 0.625 | ||
| Chinese | 52.4% (11) | 21 | 1.70 (0.64, 4.55) | 0.284 | ||
| Mixed / Other | 39.2% (38) | 97 | 1.37 (0.78, 2.41) | 0.275 | ||
| UK birth status | Non-UK Born | 40.1% (403) | 1004 | *Reference* | 0.426 | |
| UK Born | 48.6% (427) | 879 | 1.17 (0.79, 1.74) | 0.427 | ||
| Socio-economic status | 1 | 43.7% (304) | 695 | *Reference* | 0.168 | |
| 2 | 40.0% (188) | 470 | 1.26 (0.97, 1.64) | 0.0842 | ||
| 3 | 42.9% (127) | 296 | 1.20 (0.89, 1.63) | 0.235 | ||
| 4 | 49.8% (118) | 237 | 1.43 (1.03, 1.98) | 0.0322 | ||
| 5 | 50.3% (93) | 185 | 1.37 (0.96, 1.97) | 0.0841 | ||
| Public Health England centre | London | 25.3% (135) | 534 | *Reference* | 1.1e-20 | |
| West Midlands | 48.9% (149) | 305 | 3.01 (2.19, 4.14) | 1.17e-11 | ||
| North West | 61.8% (141) | 228 | 4.82 (3.39, 6.91) | 5.11e-18 | ||
| South East | 46.6% (97) | 208 | 2.36 (1.65, 3.37) | 2.23e-06 | ||
| Yorkshire and the Humber | 44.0% (74) | 168 | 2.23 (1.52, 3.26) | 3.55e-05 | ||
| East of England | 46.2% (60) | 130 | 2.36 (1.56, 3.55) | 4e-05 | ||
| East Midlands | 60.3% (94) | 156 | 4.56 (3.09, 6.77) | 3.07e-14 | ||
| South West | 53.3% (56) | 105 | 3.09 (1.97, 4.88) | 1.06e-06 | ||
| North East | 49.0% (24) | 49 | 2.84 (1.54, 5.25) | 0.000831 |
Supplementary Figure S2: a.) Shows the proportion of cases notified in a given month for each year with evidence of a seasonal peak in June. b.) Shows the proportion of cases notified on a given day for each month with a near uniform distribution. Stratifying both figures based on the introduction of the web based ETS gives little evidence for any change in these trends.
Supplementary Figure S3: a.) Shows the proportion of cases with a diagnosis in a given month for each year. b.) Shows the proportion of cases with a diagnosis on a given day for each month. Trends for the date of diagnosis were similar to those seen for notifications.
Supplementary Figure S4: a.) Shows the proportion of cases that died in a given month for each year. b.) Shows the proportion of cases that died on a given day for each month. Trends for the date of death were similar to those seen for notifications but there was a reduction in the strength of the observed seasonality.
Supplementary Figure S5: a.) Shows the proportion of cases starting treatment in a given month for each year. b.) Shows the proportion of cases starting treatment on a given day for each month. Trends for the date of starting treatment were similar to those seen for notifications.
Supplementary Figure S6: a.) Shows the proportion of cases ending treatment in a given month for each year with a peak in December. b.) Shows the proportion of cases ending treatment on a given day for each month with some evidence of a bias in reporting on the first of the month. Uncertainty reduced after the introduction of the web based ETS and the level of bias on the first of the month reduced.